In [1]:
from ggplot import *
import pandas as pd
import numpy as np

In [ ]:
%matplotlib inline

In [3]:
df = pd.read_csv("./baseball-pitches-clean.csv")
df = df[['pitch_time', 'inning', 'pitcher_name', 'hitter_name', 'pitch_type', 
         'px', 'pz', 'pitch_name', 'start_speed', 'end_speed', 'type_confidence']]
df.head()


Out[3]:
pitch_time inning pitcher_name hitter_name pitch_type px pz pitch_name start_speed end_speed type_confidence
0 2013-10-01 20:07:43 -0400 1 Francisco Liriano Shin-Soo Choo B 0.628 1.547 Fastball 93.2 85.3 0.894
1 2013-10-01 20:07:57 -0400 1 Francisco Liriano Shin-Soo Choo S 0.545 3.069 Fastball 93.4 85.6 0.895
2 2013-10-01 20:08:12 -0400 1 Francisco Liriano Shin-Soo Choo S 0.120 1.826 Slider 89.1 82.8 0.931
3 2013-10-01 20:08:31 -0400 1 Francisco Liriano Shin-Soo Choo S -0.229 1.667 Slider 90.0 83.3 0.926
4 2013-10-01 20:09:09 -0400 1 Francisco Liriano Ryan Ludwick B -1.917 0.438 Slider 87.7 81.6 0.915

5 rows × 11 columns

Getting a feel for what's going on

geom_point

I usually start by making some really simple plots like scatterplots and histograms just to make sure that things make sense.

px and pz are the coordinates of a pitch as they cross home plate. Let's plot these and see if our data makes sense.


In [3]:
ggplot(df, aes(x='px', y='pz')) + geom_point()


Out[3]:
<ggplot: (272839901)>

What about the pitch speed?


In [4]:
ggplot(aes(x='start_speed', y='end_speed'), data=df) + geom_point()


Out[4]:
<ggplot: (276734237)>

geom_hist

A better way to inspect pitch speed might be to look at a distribution of the data.

Does this make sense? Let's consult the source of all true wisdom: https://answers.yahoo.com/question/index?qid=20080126131031AAwVCNk


In [4]:
ggplot(df, aes(x='start_speed')) + geom_histogram()


stat_bin: binwidth defaulted to range/30.
    Use 'binwidth = x' to adjust this.
Out[4]:
<ggplot: (285457305)>

What about for specific pitches?


In [5]:
for name, frame in df.groupby("pitch_name"):
    print ggplot(aes(x='start_speed'), data=frame) + geom_histogram() + ggtitle("Distribution of " + str(name))


<ggplot: (288278377)>
<ggplot: (285224941)>
<ggplot: (288277409)>
<ggplot: (289871437)>
<ggplot: (293071473)>
<ggplot: (292574941)>
<ggplot: (289870441)>
<ggplot: (291709497)>

That was helpful but I'm sort of on plot overload now.

facet_wrap FTW

Use the trellis.

"Trellis Graphics is a family of techniques for viewing complex, multi-variable data sets." Read more here.


In [6]:
ggplot(aes(x='start_speed'), data=df) +\
    geom_histogram() +\
    facet_wrap('pitch_name')


/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ggplot-0.5.9-py2.7.egg/ggplot/ggplot.py:198: RuntimeWarning: Facetting is currently not supported with geom_bar. See
                    https://github.com/yhat/ggplot/issues/196 for more information
  warnings.warn(msg, RuntimeWarning)
Out[6]:
<ggplot: (292575897)>

Changeup, Curveball, Cut Fastball, Eephus....Wait, what?


In [15]:
from IPython.display import YouTubeVideo
YouTubeVideo("ikLlRT2j7EQ")


Out[15]:

Ok so what about balls and strikes.


In [8]:
ggplot(aes(x='pitch_type'), data=df) + geom_bar()


Out[8]:
<ggplot: (275730281)>

facet_grid

(facet_wraps brother)


In [9]:
ggplot(aes(x='start_speed'), data=df) +\
    geom_histogram() +\
    facet_grid('pitch_type')


Out[9]:
<ggplot: (276653609)>

In [12]:
ggplot(aes(x='start_speed'), data=df) +\
    geom_histogram() +\
    facet_grid('pitch_name', 'pitch_type', scales="free")


Out[12]:
<ggplot: (271338625)>

geom_density

Similar to geom_histogram but relative y scale.


In [13]:
ggplot(df, aes(x='start_speed')) +\
    geom_density()


Out[13]:
<ggplot: (275662825)>

In [14]:
ggplot(df, aes(x='start_speed', color='pitch_name')) +\
    geom_density()


Out[14]:
<ggplot: (278182857)>

In [ ]: